Caio Raphael

A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. The correct way to do it is just have consecutive threads access consecutive memory addresses.
GPUs batch many threads (warps/wavefronts).
If threads in a group load adjacent addresses, the hardware can merge requests into fewer memory transactions (coalescing).
Non-sequential or strided accesses increase transactions and reduce effective bandwidth.
Memory Coalescing Techniques .

Accesses are serviced in cache-line granularity; unaligned or small scattered loads can cause full-line fetches or multiple lines, increasing bandwidth pressure. Designing buffer layouts for aligned, contiguous reads reduces misses.

When many threads access the same bank with conflicting addresses, accesses serialize. Layout transforms (padding/transpose) can avoid conflicts.

Sampled image access can use specialized caches with different locality assumptions versus raw buffer loads; memory layout (tiling) influences cache efficiency.